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Abstract 

We consider estimation in a high-dimensional linear model with strongly corre- 
lated variables. We propose to cluster the variables first and do subsequent sparse 
estimation such as the Lasso for cluster-representatives or the group Lasso based 
on the structure from the clusters. Regarding the first step, we present a novel 
and bottom-up agglomerative clustering algorithm based on canonical correlations, 
and we show that it finds an optimal solution and is statistically consistent. We 
also present some theoretical arguments that canonical correlation based clustering 
leads to a better-posed compatibility constant for the design matrix which ensures 
idcntifiability and an oracle inequality for the group Lasso. Furthermore, we dis- 
cuss circumstances where cluster-representatives and using the Lasso as subsequent 
estimator leads to improved results for prediction and detection of variables. We 
complement the theoretical analysis with various empirical results. 

Keywords and phrases: Canonical correlation, group Lasso, Hierarchical cluster- 
ing, High-dimensional inference, Lasso, Oracle inequality, Variable screening, Variable 
selection. 



1 Introduction 

High-dimensional regression is used in many fields of applications nowadays where the 
number of covariables p greatly exceeds sample size n, i.e., p 3> n. We focus here on 
the simple yet useful high-dimensional linear model 

Y = X/3° + e, (1) 

with univariate n x 1 response vector Y, n x p design matrix X, p x 1 true underlying 
coefficient vector (3° and n x 1 error vector e. Our primary goal is to do variable 
screening for the active set, i.e., the support of /3°, denoted by So = {j; (3® ^ 0, j = 
1 , . . . p} : we want to have a statistical procedure S such that with high probability, 



S ^ So (and \S\ not too large). In the case where p S> n, the obvious difficulties are 
due to (near) non-identifiability. While some positive results have been shown under 
some assumptions on the design X, see the paragraph below, high empirical correlation 
between variables or near linear dependence among a few variables remain as notorious 
problems which are often encountered in many applications. Examples include genomics 
where correlation and the degree of linear dependence is high within a group of genes 



sharing the same biological pathway (Segal et al. 2003), or genome- wide association 



studies where SNPs are highly correlated or linearly dependent within segments of the 



DNA sequence (Balding, 2007). 



An important line of research to infer the active set Sq or for variable screening has been 



developed in the past using the Lasso (Tibshirani 1996) or versions thereof (Zou, 2006 



Meinshausen, 2007; Zou and Li, 2008). Lasso-type methods have proven to be success- 



ful in a range of practical problems. From a theoretical perspective, their properties 
for variable selection and screening have been established assuming various conditions 
on the design matrix X, such as the neighborhood stability or ir represent able condi- 
tion (Meinshausen and Biihlmann, 2006 Zhao and Yu , |2006 ), and various forms of 



"restricted" eigenvalue conditions, see 



van de Geer 



Meinshausen and Yu (2009); Bickel et al. (2009); van de Geer and Biihlmann 



(2007); Zhang and Huang 



2008 



2009 



Sun and Zhang (2011). Despite of these positive findings, situations where high empir- 



ical correlations between covariates or near linear dependence among a few covariables 
occur cannot be handled well with the Lasso: the Lasso tends to select only one variable 
from the group of correlated or nearly linearly dependent variables, even if many or all of 
these variables belong to the active set Sq. The elastic net (Zou and Hastie, 2005), OS- 



CAR (Bondell and Reich, 2008) and "clustered Lasso" (She, 2010) have been proposed 



to address this problem but they do not explicitly take correlation-structure among the 
variables into account and still exhibit difficulties when groups of variables are nearly 



linearly dependent. A sparse Laplacian shrinkage estimator has been proposed (Huang 



et al. 2011) and proven to select a correct set of variables under certain regularity 
conditions. However, the sparse Laplacian shrinkage estimator is geared toward the 
case where highly correlated variables have similar predictive effects (which we do not 
require here) and its selection consistency theorem necessarily requires a uniform lower 
bound for the nonzero signals above an inflated noise level due to model uncertainty. 

We take here the point of view that we want to avoid false negatives, i.e., to avoid not 
selecting an active variable from Sq: the price to pay for this is an increase in false pos- 
itive selections. From a practical point of view, it can be very useful to have a selection 
method S which includes all variables from a group of nearly linearly independent vari- 
ables where at least one of them is active. Such a procedure is often a good screening 
method, when measured by |5n S'ol/l'S'ol as a function of \S\. The desired performance 
can be achieved by clustering or grouping the variables first and then selecting whole 
clusters instead of single variables. 
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1.1 Relation to other work and new contribution 



The idea of clustering or grouping variables and then pursuing model fitting is not new, 
of course. Principal component regression (Kendall, 1957, cf.) is among the earliest 



proposals, and Hastie et al. (2000) have used principal component analysis in order to 



find a set of highly correlated variables where the clustering can be supervised by a 



response variable. Tree-harvesting (Hastie et al. 2001) is another proposal which uses 



supervised learning methods to select groups of predictive variables formed by hierar- 
chical clustering. An algorithmic approach, simultaneously performing the clustering 



and supervised model fitting, was proposed by Dettling and Biihlmann (2004), and also 



the OSCAR method (Bondell and Reich, 2008) does such simultaneous grouping and 
supervised model fitting. 

Our proposal differs from previous work in various aspects. We primarily propose to 
use canonical correlation for clustering the variables as this reflects the notion of linear 
dependence among variables: and it is exactly this notion of linear dependence which 
causes the identifiability problems in the linear model in ([!]). Hence, this is conceptually 
a natural strategy for clustering variables when having the aim to address identifiability 
problems with variable selection in the linear model 0. We present in Section 2.1 



an 



agglomerative hierarchical clustering method using canonical correlation, and we prove 
that it finds the finest clustering which satisfies the criterion function that between group 
canonical correlations are smaller than a threshold and that it is statistically consistent. 



Furthermore, we prove in Section 4.1 that the construction of groups based on canonical 
correlations leads to well-posed behavior of the group compatibility constant of the 
design matrix X which ensures identifiability and an oracle inequality for the group 
Lasso (Yuan and Lin, 2006). The latter is a natural choice for estimation and cluster 



selection; another possibility is to use the Lasso for cluster representatives. We analyze 
both of these methods: this represents a difference to earlier work where at the time, 
such high-dimensional estimation techniques have been less or not established at all. 

We present some supporting theory in Section [4j describing circumstances where clus- 
tering and subsequent estimation improves over standard Lasso without clustering, and 
the derivations also show the limitations of such an approach. This sheds light for what 
kind of models and scenarios the commonly used two-stage approach in practice, con- 
sisting of clustering variables first and subsequent estimation, is beneficial. Among the 
favorable scenarios which we will examine for the latter approach are: (i) high within 
cluster correlation and weak between cluster correlation with potentially many active 
variables per cluster; and (ii) at most one active variable per cluster where the clusters 
are tight (high within correlation) but not necessarily assuming low between cluster 
correlation. Numerical results which complement the theoretical results are presented 
in Section O 
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2 Clustering covariables 



Consider the index set {1, . . . ,p} for the covariables in ([T]). In the sequel, we denote by 
x^' the j'th component of a vector x and by = j G G} the group of variables 

from a cluster G C {1, . . . The goal is to find a partition Q into disjoint clusters 
G u ...,G q : G = {G 1: ...,G q } withU« =1 G r = {l,...,p} and G r n G t = (r 7^). The 
partition Q should then satisfy certain criteria. 

We propose two methods for clustering the variables, i.e., finding a suitable partition: 
one is a novel approach based on canonical correlations while the other uses standard 
correlation based hierarchical clustering. 



2.1 Clustering using canonical correlations 

For a partition Q = {Gi, . . . , G q } as above, we consider: 

(Q) = max{p can (G r ,G^); r,£ € {1, . . . ,q}, r ^ £}. 



Here, Pam(G r ,Gi) denotes the empirical canonical correlation (Anderson, 1984, cf.) 
between the variables from X^ Gr ' and X^ Gt \ (The empirical canonical correlation is 
always non- negative) . A clustering with r-separation between clusters is defined as: 

Q(t) = a partition Q of {1, ... ,p} such that p m ax{Q) < r (0 < r < 1). (2) 

Not all values of r are feasible: if r is too small, then there is no partition which would 
satisfy (|2j). For this reason, we define the canonical correlation of all the variables with 
the empty set of variables as zero: hence, the trivial partition <7 S i n gi e consisting of the 
single cluster {1, . . . ,p} has p max (Single) = which satisfies Q. The fact that r may not 
be feasible (except with Single) can be understood from the view point that coarse par- 
titions do not necessarily lead to smaller values of p max : for example, when p> n and if 
rank(A( GrUG ^) = n, which would typically happen if |G r UG^| > n, then p can (G r , Gi) = 
1. In general, clustering with r-separation does not have a unique solution. For 
example, if Q(t) = {G\, . . . ,G q } is a clustering with r-separation and {G r -k,k = 
1, ...,q r } is a nontrivial partition of G r with maxi< fcl<fc2 < gr p ca n(Gr;ki , Gr;k 2 ) < T , 
then {Gi, . . . , G r -i, G r -k, k = 1, . . . , q r , GV+i, . . . , G q } is a strictly finer clustering with 



r-separation, see also Lemma 7.1 below. The non- uniqueness of clustering with r- 
separation motivates the following definition of the finest clustering with r-separation. 
A clustering with r-separation between clusters, say G(t), is finest if every other cluster- 
ing with r-separation is strictly coarser than G(t), and we denote such a finest clustering 
with r-separation by 

^finest (t). 

The existence and uniqueness of the finest clustering with r-separation are provided in 
Theorem 12.11 below. 
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A simple hierarchical bottom-up agglomerative clustering (without the need to define 
linkage between clusters) can be used as an estimator Q{t) which satisfies ^ and which 
is finest: the procedure is described in Algorithm [TJ 

Theorem 2.1 The hierarchical bottom-up agglomerative clustering Algorithm^ leads 
to a partition Q{t) which satisfies Q). If r is not feasible with a nontrivial partition, 
the solution is the coarsest partition Q{t) = Single = {G} with G = {1, . . . ,p}. 

Furthermore, if r is feasible with a nontrivial partition, the solution Q{t) = £/fi ne st( T ) is 
the finest clustering with r-separation. 

A proof is given in Section [7j Theorem 2.1 describes that a bottom- up greedy strategy 
leads to an optimal solution. 

We now present consistency of the clustering Algorithm [TJ Denote the population 
canonical correlation between X^ Gr ^ and X^ Ge ^ by p C an(GV, G^), and the maximum 
population canonical correlation by max r ^ p c &n(G r , Ge), respectively, for a partition 
Q = {Gi, . . . , Gq] of {1, ... ,p}. As in ([2]), a partition G(t) is a population clustering 
with r-separation if p m3iX (G) < r. The finest population clustering with r-separation, 
denoted by ^finest (t), is the one which is finer than any other population clustering 
with r-separation. With the convention p ma x( ^single) = 0, the existence and uniqueness 



of the finest population clustering with r-separation follows from Theorem 2.1. In 
fact, the hierarchical bottom-up agglomerative clustering Algorithm [TJ yields the finest 
population clustering ^fi nes t (t) with r-separation if the population canonical correlation 
is available and used in the algorithm. 

Let ^finest (t) be the finest population clustering with r-separation and Q{r) be the sam- 
ple clustering with r-separation generated by the hierarchical bottom-up agglomerative 
clustering Algorithm [TJ based on the design matrix X. The following theorem provides 
a sufficient condition for the consistency of Q{t) in the Gaussian model 

X 1 ,...,X n lld. ~AA p (0,S). (3) 

For any given t > and positive integers q and d\, . . . , d q , define 

tr = + v(2/n)(t+iog( g (,+i)), A* r/ = 3( (; A _ t ; r ) ) | 1 ( _y ) - w 



Algorithm 1 Bottom-up, agglomerative hierarchical clustering using canonical corre- 
lations. 

1: Start with the single variables as p clusters (nodes at the bottom of a tree). Set 
6 = 

2: repeat 

3: Increase b by one. Merge the two clusters having highest canonical correlation. 
4: until Criterion pi) is satisfied. 
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Theorem 2.2 Consider X from ^ and Q° = {Gi, . . . , G g } a partition of {1, ... ,p}. 
Let t > and d r = rank(£(3 r) G r ). Define A* e by 01). Suppose 



max {p can (G r , Gg) + A* r A < t_ < r + 
i<r<e<q L ■ J 

< min min { max p caI i(G r . k 1 ,G r . fs2 ) - A* r \, (5) 

l<r<q{G r - k } ki<k 2 

where min{g r fc } is taken over all nontrivial partitions {G r -k,k < q r } of G r . Then, 
Q° = ^finest ( r ) is the finest population clustering with r -separation for all t_ < r < t + , 
and 

V[9(t) = g finest (T), Vr_ < r < r+1 > 1 - exp(-t). 



A proof is given in Section[7[ We note that t = y / log(p) leads to t r x y/ Tank(Y,G r ,G r ) / n + 
A/log(p)/n: this is small if rank(Sc rj G r ) = o(n) and log(p) = o(n), and then, A* £ is 
small as well (which means that the probability bound becomes 1 — p^ 1 — > I (p — > oo) 
(or p > n — > oo). 

The parameter r in ^ needs to be chosen. We advocate the use of the minimal 
resulting r. This can be easily implemented: we run the bottom-up agglomerative 
clustering Algorithm [T] and record in every iteration the maximal canonical correlation 
between clusters, denoted by p max (6) where b is the iteration number. We then use the 
partition corresponding to the iteration 

b = argminp max (6). (6) 

b 

A typical path of /5 max (6) as a function of the iterations b, the choice b and the corre- 
sponding minimal p mSLX (b) are shown in Figure [l] 

We conclude that the hierarchical bottom- up agglomerative clustering Algorithm [T] with 
the rule in (16]) is fully data-driven, and there is no need to define linkage between clusters. 



2.2 Ordinary hierarchical clustering 



As an alternative to the clustering method in Section |2.1[ we consider in Section 5.2 
ordinary hierarchical agglomerative clustering based on the dissimilarity matrix D with 
entries D r - t p = 1 — \p{X <yT \ X^)\, where p{X^ r \X^) denotes the sample correlation 
between and X^K We choose average-linkage as dissimilarity between clusters. 

As a cutoff for determining the number of clusters, we proceed according to an estab- 
lished principle. In every clustering iteration b, proceeding in an agglomerative way, we 
record the new value h\, of the corresponding linkage function from the merged clusters 
(in iteration b): we then use the partition corresponding to the iteration 

b = argmax(/ifc +1 — hb). 

b 
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Figure 1: Path of p max (6) as a function of the iterations b: from real data described in 



Section 5.2 with p = 1000 and n = 71. 



3 Supervised selection of clusters 



From the design matrix X, we infer the clusters G\, . . . , G q as described in Section [2j 
We select the variables in the linear model ([I]) in a group- wise fashion where all variables 
from a cluster G r (r = 1, . . . , q) are selected or not: this is denoted by 

•^cluster = {r; cluster G r is selected, r = 1, . . . , q}. 

The selected set of variables is then the union of the selected clusters: 



S — U c § G r 



(7) 



We propose two methods for selecting the clusters, i.e., two different estimators Sduster- 



3.1 Cluster representative Lasso (CRL) 

For each cluster we consider the representative 

where XW denotes the jth n x 1 column-vector of X. Denote by X the n x q design 
matrix whose rth column is given by XM. We use the Lasso based on the response Y 
and the design matrix X: 

Pcrl = argmin(||Y - Xpg/n + A CMj ||/3||i). 
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The selected clusters are then given by 

£cluster,CRL = "^cluster ,CRl(AcRl) = { r ! /^CRL,r (^CRl) / 0, T = 1, . . . ,(?}, 

and the selected variables are obtained as in ([T]), denoted as 
Scrl = 5 C rl(A C rl) = U re s ciuatcr CRh G r . 

3.2 Cluster group Lasso (CGL) 

Another obvious way to select clusters is given by the group Lasso. We partition the 
vector of regression coefficients according to the clusters: (3 = (f3c 1 , ■ ■ ■ ,PG q ) T , where 
@G r = ({Pj'i j ^ Gr}) T . The cluster group Lasso is defined as 

Q 

ficGL = argmin ||Y - X/3||l/n + A CG L V^IIX^gJ^- 1 / 2 , (8) 

where w r is a multiplier, typical pre-specified as w r = y/\G r \. It is well known that 
the group Lasso enjoys a group selection property where either /3cGL.G r / (all com- 
ponents are different from zero) or /3cGL,G r = (the zero- vector) . We note that the 
estimator in ^ is different from using the usual penalty A Yll=i w r \\Pg t II 2 : the penalty 
in ^ is termed as "groupwise prediction penalty" in Biihlmann and van de Geer (2011 



Sec. 4. 5.1): it has nice parameterization invariance properties, and it is a much more 
appropriate penalty when x( Gr ) exhibits strongly correlated columns. 

The selected clusters are then given by 

•Sduster.CGL = Scluster,CGL(AcGlJ = {r", /3cGL,G r (AcGl) 7^ 0> r = 1, • • • ? 

and the selected variables are as in ([7]) : 

5cgl = 5cgl(Acgl) = U a G r = {j; $cglj / 0, j = 1, . . . ,p}, 

' ^^clustci^CGL ^ 

where the latter equality follows from the group selection property of the group Lasso. 



4 Theoretical results for cluster Lasso methods 

We provide here some supporting theory, first for the cluster group Lasso (CGL) and 
then for the cluster representative Lasso (CRL). 

4.1 Cluster group Lasso (CGL) 

We will show first that the compatibility constant of the design matrix X is well-behaved 
if the canonical correlation between groups is small, i.e., a situation which the clustering 
Algorithm [T] is exactly aiming for. 
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The CGL method is based on the whole design matrix X, and we can write the model 
in group structure form where we denote by X^ Gr ^ the n x \G r \ design matrix with 
variables {XW; j G G r }: 



Y = X/3° + e = X {Gr) /3c r + 



(9) 



r=l 



where 0°, p = ({/3°; j G G r }) T . We denote by S ,Grou P = {r; / 0, r = 1, . . . , q}. 

We introduce now a few other notations and definitions. We let m r := \G r \, r = 1, . . . , q, 
denote the group sizes, and define the average group size 



in 



r=l 



and the average size of active groups 



^^O, Group - I Q 

1*0, 



Group | 



T'GS'o, Group 



Furthermore, for any S C {1, . . . , </}, we let X^) be the design matrix containing the 
variables in LVgsGy. Moreover, define 

||/3s||2,l := Y, \\^ {Gr) PG r hVmr/m. 
res 

We denote in this section by 

±rjt := (X( G '-)) T X^)/n, r,le{l,...,q}. 

We assume that each E rr is non-singular (otherwise one may use generalized inverses) 
and we write, assuming here for notational clarity that So.Group = {!>■•• > s o} (so = 
l^o.Groupl) is the set of indices of the first so groups: 



■"'So, Group - 



A-1/2A A-l/2 
^2,2 ^2,1^11 



A-1/2A A-l/2 y-l/2A A-l/2 \ 

^•1,1 ^1,2^2,2 ^1,1 ^l.so^So^O 
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"2,2 "2,80 ^so, so 



, A-1/2A A-l/2 A-l/2 
\ ^scso^so, 1^1,1 ^sq, s ^^0,2^2,2 



/ 



The group compatibility constant given in Biihlmann and van de Geer (2011 ) is a value 
^o,Grou P ( x ) that satisfies 



y 0,Group 



X) < min 



raSol^Groupl || X/3 Hi 



2-5 ll^g iGroup ||2,l < ||P5 ,Grou P ll2 



III 



' So, Group 1 1 2,1 



The constant 3 here is related to the condition A > 2Ao in Proposition 4.1 below: in 
general it can be taken as (A + Aq)/(A — Aq). 
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Theorem 4.1 Suppose that Rs 0Group has smallest eigenvalue A^j n > 0. Assume more- 
over the incoherence conditions: 



m . . ^A min m/m SOjGroup 



p := max p C a,n(G r , Gg) < C — L T ^ for some < C < 1, 

re-wesg y/m r mi 3|So,Grou P | 

Tfl 1 

PS , G rou P -= Paq maX /—— PcUGriGz) < — 

r,teS ,Grou P , ri=i y/m r mi P0,Group| 

Then, the group Lasso compatibility holds with compatibility constant 

^0,Group( X ) > ( A minW^5 ,Gro Up - 3 l 5 0,Group|p^ /^A min m 2 /^l , Group ) 



> (1 - C)^A min > (1 - CY(1 - \S 0>Glonp \p SoiGlo J 

> 0. 

A proof is given in Section [7j We note that small enough canonical correlations between 
groups ensure the incoherence assumptions for p and ps Grou , and in turn that the 
group Lasso compatibility condition holds. The canonical correlation based clustering 
Algorithm [T] is tailored for this situation. 



Remark 1. One may in fact prove a group version of Corollary 7.2 in Biihlmann and 



van de Geer (2011) which says that under the eigenvalue and incoherence condition of 



Theorem 4.1, a group irrepresentable condition holds. This in turn implies that, with 



large probability, the group Lasso will have no false positive selection of groups, i.e., one 



has a group version of the result of Problem 7.5 in Biihlmann and van de Geer (2011 ) 



Known theoretical results can then be applied for proving an oracle inequality of the 
group Lasso. 

Proposition 4.1 Consider model |7p with fixed design X and Gaussian error e ~ 
Af n (0,a 2 I). Let 



x 2 
Ao = cr —= a 
sjn \ 



1+ / 4t + 41og(p) | 4t + 41og(p) 



where m mm = min r= i . ? \m r \. Assume that S rr is non- singular, for all r = l,...,q. 
Then, for the group Lasso estimator /9(A) in with A > 2Ao, and with probability at 
least 1 — exp(— t): 



l|X(/3(A) -/3°)|||/n + A^v^||/§ Gr (A) -/3°J 2 < 24A 2 £ m r /^ Group (X), (10) 

r=1 reSo, Group 

where 4% Group ( X ) denotes the group compatibility constant. 



10 



Proof: We can invoke the result in Biihlmann and van de Geer (2011, Th.8.1): using 



the groupwise prediction penalty in (|8|) leads to an equivalent formalization where we 
can normalize to S rr 



1 m r X m r ■ 



The requirement A > 4Aq in 



Biihlmann and van de 



Geer (2011, Th.8.1) can be relaxed to A > 2Aq since the model ([lj) is assumed to be 



true, see also Biihlmann and van de Geer (2011, p. 108, Sec. 6. 2. 3). 



□ 



4.2 Linear dimension reduction and subsequent Lasso estimation 

For a mean zero Gaussian random variable 7 6l and a mean zero Gaussian random 
vector I 6 R p , we can always use a random design Gaussian linear model representa- 
tion: 

i=i 

X ~ A/"p(0,£), e ~ A/"(0,a 2 ), (11) 

where e is independent of X. 
Consider a linear dimension reduction 

Z = Aq X pX 

using a matrix A qxp with q < p. We denote in the sequel by 

fix = E[Y\X] = J2$ xU) > M = E i Y W = E%°^ r) . 

j=l r=l 

Of particular interest is Z = (X^\ . . . ,X^) T , corresponding to the cluster represen- 
tatives X^ = |G r | _1 ^2j eGr X^\ Due to the Gaussian assumption in ( JTTj ) , we can 
always represent 

r=l 

where rj is independent of Z. Furthermore, since \iz is the linear projection of Y on the 
linear span of Z and also the linear projection of fix on the linear span of Z, 

f = Var(7?) = Var(e + fix - Hz) = <J 2 + Efc - fi z )\ 

For the prediction error, when using the dimension-reduced Z and for any estimator 7, 
we have, using that fix — fiz is orthogonal to the linear span of Z: 

E[(X T /3° - Z T j) 2 } = E[(Z r 7 ° - Z T j) 2 ] + E[(fi x - fi Z )\ (13) 

Thus, the total prediction error consists of an error due to estimation of 7 and a 
squared bias term B 2 = E[(fix — Hz) 2 ]- The latter has already appeared in the variance 
£ 2 = Var(r/) above. 
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Let Y, X and Z be n i.i.d. realizations of the variables Y, X and Z, respectively. Then, 

Y = X/3° + e = Z7 + rj. 
Consider the Lasso, applied to Y and Z, for estimating 7 : 
7 = argmin 7 (||Y - Zj\\l/n + A||7||i). 
The cluster representative Lasso (CRL) is a special case with n x q design matrix Z = X. 



Proposition 4.2 Consider n i.i.d. realizations from the model (11). Let 



Aq = 2||<7z||oo£ 



t 2 + 2 log(g) 



n 



where ||o"z||oo = niax r= i r . ig s/Jn l Z T Z) rr , and £ 2 = a 2 + E[(//x — A*z) 2 ]- Then, for 
A > 2Ao anc? mt/i probability at least 1 — 2exp(— 1 2 /2), conditional on Z: 

l|Z(7 - 7°)[|l/n + A||7 - 7°l|i < 4A 2 S (7°)/0o(Z), 
where 3(7°) equals the number of non-zero coefficients in 7 and </>o(Z) denotes the 



compatibility constant of the design matrix Z (Buhlmann and van de Geer , \20Ti , (6-4))- 
A proof is given in Section [7j The choice A = 2Ao leads to the convergence rates: 

's(7°)Iog(g)' 



|Z( 7 (A)-7 )||I/n = O J 



n4> 2 {Z) 



17(A) 



7°lll 



Op 



a(7°) /log(g) 



*8(Z) 



(14) 



The second result about the £i-norm convergence rate implies a variable screening prop- 
erty as follows: assume a so-called beta-min condition requiring that 



mm 

reS(7°) 



(15) 



for a sufficiently large C > 0, then, with high probability, <Sy,z(A) 71 5 , (7°), where 
5 Y ,z(A) = {r; >(A) # 0} and 5( 7 °) = {r; 7r° / 0}. 



The results in Proposition 4.2 ( 14 ) and ( 15 ) describe inference for Z7 and for 7 . Their 
meaning for inferring X/3 U and for (groups) of f3° are further discussed in Section 4.3 



for specific examples, representing 7 in terms of /3 and the correlation structure of X, 
and analyzing the squared bias B 2 = K[(fix — f^z) 2 }- The compatibility constant 4>q{Z) 
in Proposition 4.2 is typically (much) better behaved, if q <C p, than the corresponding 
constant <Pq(X.) of the original design X. Bounds of <^o(X) and 4>q{Z) in terms of their 
population covariance S and AY,A T , respectively, can be derived from Buhlmann and 



van de Geer (2011, Cor. 6. 8). Thus, loosely speaking, we have to deal with a trade-off: 



12 



the term </>g(Z), coupled with a log(g)- instead of a log(p)-factor (and also to a certain 
extent the sparsity factor 5(7°)) are favorable for the dimensionality reduced Z. The 

discussed further in Section 

as 



price to pay for this is the bias term B 2 = E[(/x 

4.3 which appears in the variance £ 2 entering the definition of Aq in Proposition 4.2 



x - VZ) 



well as in the prediction error (13); furthermore, the detection of 7 instead of /3° can 



be favorable for some cases and not favorable for others, as discussed in Section 4.3 



Finally, note that Proposition 4.2 makes a statement conditional on Z: with high prob- 
ability 1 — a, 



Z7- Zj°\\l/n < 4X z s(f 



(Z)|Z] > 1-a. 



Assuming that 0q(Z) — ^o(^'^) * s bounded with high probability (Biihlmann and 



van de Geer 2011, Lem.6.17), we obtain (for a small but different a than above): 
P[||Z 7 - Z 7 °\\l/n < 4X 2 s( 7 °)/4(A, £)] > 1 



In view of (13), we then have for the prediction error: 
E[||X/3 -Z 7 ||!/n] = E[||Z 7 



Z 7 °||2/n + E[( M 



x 



a. 



Hz) 



(16) 



where Z7 — Z7°||2/n] O(£s(7 ) \f\og{p)/n) when choosing A = 2Aq. 



4.3 The parameter 7 for cluster representatives 

In the sequel, we consider the case where Z = X = (X^,...,X^) T encodes the 
cluster representatives 1" = | C r . | 1 YljeG r ^ ^ e ana ly ze the coefficient vector 7 
and discuss it from the view-point of detection. For specific examples, we also quantify 
the squared bias term B 2 = E[(/ix — Hx) 2 \ which plays mainly a role for prediction. 

The coefficient 7 can be exactly described if the cluster representatives are independent. 



Proposition 4.3 Consider a random design Gaussian linear model as in (11) where 
Cov(XW, JW) = for all r / I. Then, 

7? = |G r | Y. w ^v r = l,-..,q, 

jeG r 



in 



Moreover: 



l^eeGr l^k=i ^z,k jeGr 



1. If, for r G {1, . . . , q] , E^fc > for all j, k £ G r , then Wj > (j € G r ), and 7j?/|G r | 
is a convex combination of ft® (j G G r ). In particular, if (3® > for all j G G r , 
or /3° < for all j £ G r , then 

| 7r °|>|G r |min|/3°|. 
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2. If, for r £ {1, ... , q}, Ejj = 1 for all j £ G r and Y^k^j = C for all j £ G r , 
then Wj = | C r . | ~ 1 and 

7,° = E # 

A concrete example is where S has a block- diagonal structure with equi- correlation 
within blocks: E^fc = p r (J,k £ G r , j ^ k) with —l/(\G r \ — 1) < p r < 1 (where 
the lower bound for p r ensures positive definiteness of the block-matrix Eq t ^q t ). 

A proof is given in Section [7| The assumption of uncorrelatedness across {jW; r = 
l,...,q} is reasonable if we have tight clusters corresponding to blocks of a block- 
diagonal structure of S. 

We can immediately see that there are benefits or disadvantages of using the group 
representatives in terms of the size of the absolute value \"fr\: obviously a large value 
would make it easier to detect the group G r . Taking a group representative is advanta- 
geous if all the coefficients within a group have the same sign. However, we should be 
a bit careful since the size of a regression coefficient should be placed in context to the 
standard deviation of the regressor: here, the standardized coefficients are 

7r yVar(XM). (17) 

For e.g. high positive correlation among variables within a group, Var(X( r )) is much 
larger than for independent variables: for the equi-correlation scenario in statement 2. 



of Proposition 4.3 we obtain for the standardized coefficient 



7r y Var(XM) = £ 0)sfp + \G r \~\\ - p) « £ 0j, (18) 

jeG r jeG r 

where the latter approximation holds if p ~ 1. 

The disadvantages occur if rough or near cancellation among 09 (j £ G r ) takes place. 
This can cause a reduction of the absolute value of |-y° | in comparison to maXj 6 G r \0j\' 



again, the scenario in statement 2. of Proposition 4.3 is most clear in the sense that the 



sum of ft® (j £ G r ) is equal to 7°, and near cancellation would mean that YljeG r Pj ~ 0- 



An extension of Proposition [473] can be derived for covering the case where the regressors 
{X^; r = 1, . . , ,q} are only approximately uncorrelated. 



Proposition 4.4 Assume the conditions of Proposition [7jT75] but instead of uncorrelat- 
edness of {X^; r = 1, . . . , q} across r, we require: for r £ {1, . . . , q}, 

\Cov(X®,xW\{XW; i + r})\ < v for all j £ G r , i $ G r . 

Moreover, assume that Var(XW|{XW; £ ^ r}) > C > 0. Then, 



\G r \ E WjPj+&r, \A r \ <v\\0°\\l/C. 

jeGr 
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Furthermore, if Cov(X®,X®\{XW', I / r}) > for all j G G r , i ^ G r , and 
Var(X( r )|{X( fc ); fc ^ r}) > C > 0, then: 

| 7 °| > |G r | min |/3°| + A r , |A r | < ^||/3°||i/C. 

A proof is given in Section]?] The assumption that \Cov(X^\X^\{X^; k / r})| < f 
for all j £ G r , i ^ G r is implied if the variables in G r and G^ (r 7^ ^) are rather 
uncorrected. Furthermore, if we require that <C |G r |minj g G r |/3?| and Cxi 

(which holds if Ejj = 1 for all j £ G r and if the variables within G r have high conditional 
correlations given {X^; £ 7^ r}), then: 

| 7r °|>|G r |min|/?°|(l + o(l)). 

Thus, also under clustering with only moderate independence between the clusters, we 
can have beneficial behavior for the representative cluster method. This also implies 
that the representative cluster method works if the clustering is only approximately 



correct, as shown in Section 4.6 



We discuss in the next subsections two examples in more detail. 



4.3.1 Block structure with equi-correlation 

Consider a partition with groups G r (r = 1, . . . , q). The population covariance matrix £ 
is block-diagonal having \ G r \ x \G r \ block-matrices £G r ,G r with equi-correlation: T,jj = 1 
for all j £ G r , and Sj^ = p r for all j, k £ G r (j 7^ k), where — l/(|G r | — 1) < p r < 1 (the 
lower bound for p r ensures positive definiteness of ^G r ,G r )- This is exactly the setting 



as in statement 2. of Proposition 4.3 



The parameter 7 equals, see Proposition |4.3 



Regarding the bias, observe that 

q 

~ Px = X^ X ; r ~~ ^X;r)> 

where px,r = YljeG r PjX^ and p x . r = X^'^, and due to block-independence of X, 
we have 



r=l 
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For each summand, we have \ix\r ~ Px-r = ^2jeG r ^ iftj ~ Pr)> w here ffi = \G r 
E jeGr Thus, ' 

E[(p X , r -^;r) 2 ] = ^(/3°-^°) 2 + 2p r Yl (P'-PrM-Pr) 

jeG r j,k£G r ;j=£k 



j&G r 



Therefore, the squared bias equals 



B 2 = EK^t - MX;r) 2 ] = £(1 " Pr) E (# " ( 19 ) 

r=l ieG r 

We see from the formula that the bias is small if there is little variation of /3° within 
the groups G r or/and the /? r 's are close to one. The latter is what we obtain with 
tight clusters: there is a large within and small between groups correlation. Somewhat 
surprising is the fact that the bias is becoming large if p r tends to negative values (which 



is related to the fact that detection becomes bad for negative values of p r , see (18)) 



Thus, in summary: in comparison to an estimator based on X, using the cluster rep- 
resentatives X and subsequent estimation leads to equal (or better, due to smaller 
dimension) prediction error if all p r 's are close to 1, regardless of 0°. When using the 
cluster representative Lasso, if B 2 = X^=i(l — Pr) J2jeG r (@j ~@r) 2 = 0( s (7°) log(g)/n), 
then the squared bias has no disturbing effect on the prediction error as can be seen 



from (14). 



With respect to detection, there can be a substantial gain for inferring the cluster G r 
if Pj U £ G T ) have all the same sign, and if p r is close to 1. Consider the active groups 

So.Group = {r; P Gr / 0, r = 1, . . . , q}. 

For the current model and assuming that YljeG r @j ^ ^ ( r = ^' " " • > (* ,e '' no exac * 
cancellation of coefficients within groups), Sb,Group = ^(7°) = { r > 7r 0}- I n view of 



(18) and (15), the screening property for groups holds if 



mm lyj^+lftl-Hl-rtlaC^J 1 ^. 

r£S , Group ~ J 00 ( X ) ' 71 

This condition holds even if the non-zero flj s are very small but their sum within a 
group is sufficiently strong. 



4.3.2 One active variable per cluster 

We design here a model with at most one active variable in each group. Consider a 
low-dimensional q x 1 variable U and perturbed versions of U^ r ' (r = l,...,q) which 
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constitute the p x 1 variable X: 

X (r,l) = [/ (r) ) r = 1> . >>>9} 

X (r,i) = ^(r) + £(rj) ; j = 2 , . . . , m r , r = 1, . . . , g, 

tf (r,2) ) _ ? 5 (r,mr) { A d ^(q, r r 2 ) and independent among r = 1, . . . , q. (20) 

The index j = 1 has no specific meaning, and the fact that one covariate is not perturbed 
is discussed in Remark 2 below. The purpose is to have at most one active variable in 
every cluster by assuming a low-dimensional underlying linear model 



(21) 



r=l 



where e ~ AA(0, a 2 ) is independent of U . Some of the coefficients in /3° might be zero, 
and hence, some of the EA r ) 's might be noise covariates. We construct a p x 1 variable 
X by stacking the variables X^ as follows: lE£>'+i) = X^') (j = 1, . . . ,m r ) for 
r = 1, . . . , q. Furthermore, we use an augmented vector of the true regression coefficients 



4° if 3 = ZeZlrn e + l, r = 1, 
otherwise. 



Thus, the model in (21) can be represented as 

p 



where e is independent of X. 



Remark 2. Instead of the model in (20)— (21), we could consider 



X (r,j) = U (r) +d (r,j)^ j = 1, . . . , ^ r = 1, . . . , g, 

<5 (r,1) , . . .,5 ( - r,mrS> i.i.d. AA(0,T r 2 ) and independent among 



1,- 



and 



where e is independent of X. Stacking the variables X^'^ as before, the covari- 
ance matrix Cov(X) has q equi-correlation blocks but the blocks are generally depen- 
dent if U^,...,U^ are correlated, i.e., Cov(X) is not of block-d iagonal form. If 
Cov(U<r\U<Q) = (r ^£), we are back to the model in Section 4.3.1 For more general 



covariance structures of U, an analysis of the model seems rather cumbersome (but see 



Proposition 4.4) while the analysis of the "asymmetric" model (20)-(21) remains rather 



simple as discussed below. 
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We assume that the clusters G r are corresponding to the variables {X^ r '^; j = 1, . . . , m r } 
and thus, m r = \G r \. In contrast to the model in Section 4.3, 1| we do not assume un- 
correlatedness or quantify correlation between clusters. The cluster representatives are 

X<r) = m -i ^ X™ = U {r) + W ir \ 
jeG r 

W( r ) A/YO, r 2 — r —^ — ) and independent among r = 1, . . . , q. 
mi 



As in (12), the dimension-reduced model is written as Y = J2f=i Ir-^^ + V- 
For the bias, we immediately find: 

b 2 = n^x-vx?] 



< E[(U T p° -X^]=E\W T ^°\ 2 <s m^m 2 m a ^'-^^rl (22) 

1 J r rm^ 



m r — 1 

j ' J ' r m 2 

Thus, if the cluster sizes m r are large and/or the perturbation noise t 2 is small, the 
squared bias B 2 is small. 

Regarding detection, we make use of the following result. 
Proposition 4.5 For the model in \2f§ we have: 

- 7 °|| 2 < 2 J B 2 /ALn(Cov(C/)) = 2E|Ty T /3°| 2 /A 2 nin (Cov(C/)) 



< 2so max |/3?| 2 max — r —= — r 2 /A 2 in (Cov([/)), 
j J r mi 



where A 2 nin (Cov(C/)) denotes the minimal eigenvalue of Cov(U). 

A proof is given in Section [TJ Denote by Sq = {r; f3® ^ 0}. We then have: 



if min|/3 r °| > 2J2s max|/30|max^ r ^r2/A 2 lin (Cov(C/)), 

rS5 V j V m r 



m r — 1 



then: min | 7r °| > J2s max \$\ max -^-— T 2 /\ 2 .(Cov(C/)). (23) 

rdS y j J r mi 

This follows immediately: if the implication would not hold, it would create a contra- 



diction to Proposition 4.5 
We argue now that 

r 2 

max^ = 0(log(g)/n (24) 
r m r 

is a sufficient condition to achieve prediction error and detection as in the g-dimensional 
model (21 ). Thereby, we implicitly assume that maxj |/3°| < C < oo and A 2 nin (Cov(C/)) > 
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L > 0. Since we have that 5(7°) > so (excluding the pathological case for a particular 
combination of /3° and S = Cav(X)), using (22) the condition (24) implies 



O(s log(q)/n) < O(s( 7 )log( g )/n) 



can 



which is at most of the order of the prediction error in (14), where </>q(Z) 
be lower-bounded by the population version 0q(Cov(X)) > <Pq(Cov(U)) (Biihlmann and 



van de Geer, 2011, Cor. 6. 8). For detection, we note that (24) implies that the bound in 



(23) is at most 



min |7 r I > 

reS 



'2sq max |/3? I max 
j J r 



m r 



mt 



VAL n (Cov(C/)) < 0( S ( 7 °)v / M^). 



The right-hand side is what we require as beta-min condition in (15) for group screen- 
ing such that with high probability S D S*(7 ) D So (again excluding a particular 
constellation of 0° and E). 

The condition 
large, or if r r 2 = 



24j) itself is fulfilled if m r x n/log(q), i.e., when the cluster sizes are 
0(log(q)/n), i.e., the clusters are tight. An example where the model 



(20)-(21) with (24) seems reasonable is for genome- wide association studies with SNPs 
where p ~ 10 6 , n ~ 1000 and m r can be in the order of 10 3 and hence q ~ 1000 when 



e.g. using the order of magnitude of number of target SNPs (Carlson et al. , 2004). Note 



that this is a scenario where the group sizes m r 3> n where the cluster group Lasso 
seems inappropriate. 



4.3 about the bias B 2 and the parameter 7 has 



The analysis in Sections 4.2 and 
immediate implications for the cluster representative Lasso (CRL), as discussed in the 
next section. 



4.4 A comparison 

We compare now the results of the cluster representative Lasso (CRL), the cluster group 
Lasso (CGL) and the plain Lasso, at least on a "rough scale" . 

For the plain Lasso we have: with high probability, 

H-MPLasso - P )h/ n ~ U [ , 2 /v\ 



so /logQp) 
j(X)V n 



^Lasso " n\l = 0[ "27^ ), (25) 



which both involve log(p) instead of \og{q) and more importantly, the compatibility 
constant 0(jP^) °^ ^ ne design matrix instead of 0(jP^) °^ ^ ne ma t r i x X. If p is large, 
then 0q (X) might be close to zero; furthermore, it is exactly in situations like model 



(20) and (|21[), having a few (sq < q) active variables and noise covariates being highly 



correlated with the active variables, which leads to very small values of </>q(X), see van de 
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Geer and Lederer (2011). For variable screening S'Lasso(^) =! So with high probability, 



the corresponding (sufficient) beta-min condition is 



min|/3°| >CW— /<^o(X) (26) 
jeSo J V n 

for a sufficiently large C > 0. 

For comparison with the cluster group Lasso (CGL) method, we assume for simplicity 
equal group sizes \G r \ = m r = m for all r and log(g)/m < 1, i.e., the group size is 
sufficiently large. We then obtain: with high probability, 



||X(/3cGL-/3°)||i/n = 0( lS J' GTOU i™ ), 
q 

£ 

r=l 



(27) 



For variable screening Scgl(A) 2 So with high probability, the corresponding (sufficient) 
beta-min condition is 



j0 II -> ^ |Sp,Group 



reSo.Group V"^0,Group( X ) 



for a sufficiently large C > 0. The compatibility constants </>q(X) in (25 ) and 0q Group (X) 
in (27) are not directly comparable, but see Theorem 4.1 which is mfavor of the CGL 
method. "On the rough scale" , we can distinguish two cases: if the group-sizes are large 
with only none or a few active variables per group, implying so = |So| ~ |So,Group|, 
the Lasso is better than the CGL method because the CGL rate involves |So,Group|^ 
o r I So,Group | "\A"> respectively, instead of the sparsity so appearing in the rate for the 
standard Lasso; for the case where we have either none or many active variables within 
groups, the CGL method is beneficial, mainly for detection, since ISo^roupl^ ~ |So| = 
so but ISo^roupl-v/™ < I So I = so- The behavior in the first case is to be expected since 
in the group Lasso representation (j9j), the parameter vectors are very sparse within 
groups, and this sparsity is not exploited by the group Lasso. A sparse group Lasso 



method (Friedman et al. 2010a) would address this issue. On the other hand, the CGL 



method has the advantage that it works without bias, in contrast to the CRL procedure. 
Furthermore, the CGL can also lead to good detection if many /3?'s in a group are small 
in absolute value. For detection of the group G r , we only need that \\/3q 1 1 2 is sufficiently 
large: the signs of the coefficients of /3c r can be different and (near) cancellation does 
not happen. 

For the cluster representative Lasso (CRL) method, the range of scenarios, with good 
performance of CRL, is more restricted. The method works well and is superior over 
the plain Lasso (and group Lasso) if the bias B 2 is small and the detection is well-posed 
in terms of the dimension-reduced parameter 7 . More precisely, if 

R 2 _ n / s(7°)iog(q) 
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model 


assumption predict, screen. 


equi-corr. blocks (Sec. 
equi-corr. blocks (Sec. 
< 1 var. per group (Se 


4.3.1 
4.3.1 
3. 4.3 


) 
) 

.2) 


small value in ( 
e.g. same sign for /3° 
(24) 
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u 


) + NA 
G G r ) NA + 
+ + 



Table 1: Comparison of cluster representative Lasso (CRL) with plain Lasso in terms 
of prediction and variable screening. The symbol "+" encodes better theoretical results 
for the CRL in comparison to Lasso; an "NA" means that no comparative statement 
can be made. 



the CRL is better than plain Lasso for prediction since the corresponding oracle in- 



equality for the CRL becomes, see (16): with high probability. 



|X 7 -X/3 C 



ll/» + A||7 



7 111 



< O 



s(7°) log(g) 



=>o(X) 



We have given two examples and conditions ensuring that the bias is small, namely (19) 



and (24). The latter condition (24) is also sufficient for better screening property in the 



model from Section 4.3.2 For the equi-correlation model in Section 4.3.1 the success of 
screening crucially depends on whether the coefficients from active variables in a group 



nearly cancel or add- up (e.g. when having the same sign), see Propositions 4.3 and 4.4 
The following Table [T] recapitulates the findings. 

Summarizing, both the CGL and CRL are useful and can be substantially better than 
plain Lasso in terms of prediction and detection in presence of highly correlated vari- 
ables. If the cluster sizes are smaller than sample size, the CGL method is more broadly 
applicable, in the sense of consistency but not necessarily efficiency, as it does not in- 
volve the bias term B 2 and constellation of signs or of near cancellation of coefficients 
in is not an issue. For group sizes which are larger than sample size, the CGL is not 
appropriate: one would need to take a version of the group Lasso with regularization 



within groups (Meier et al. 2009; Friedman et al. 2010a). The CGL method benefits 



when using canonical correlation based clustering as this improves the compatibility 
constant, see Theorem |4.1| The CRL method is particularly suited for problems where 
the variables can be grouped into tight clusters and/or the cluster sizes are large. There 
is gain if there is at most one active variable per cluster and the clusters are tight, 
otherwise the prediction performance is influenced by the bias B 2 and detection is de- 
pending on whether the coefficients within a group add-up or exhibit near cancellation. 
If the variables are not very highly correlated within large groups, the difficulty is to 
estimate these groups, and in case of correct grouping, as assumed in the theoretical 
results above, the CRL method may still perform (much) better than plain Lasso. 



4.5 Estimation of the clusters 

The theoretical derivations above assume that the groups G r (r = 1,.. . ,q) correspond 
to the correct clusters. For the canonical correlation based clustering as in Algorithm [TJ 
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Theorem 2.2 discusses consistency in finding the true underlying population clustering. 



For hierarchical clustering, the issue is much simpler. 



Consider the n x p design matrix X as in ([3j) and assume for simplicity that Xjj 
for all j. It is well-known that 



max \ ^jj k 



%fcl =0 P (Vlog(p)/n), 



(28) 



where £ is the empirical covariance matrix (Biihlmann and van de Geer 2011 , cf. p. 152). 
Tightness and separation of the true clusters is ensured by: 



mm {| s i,fcl; j> k G G r U / fe), r = i,...<?} 
> max{|E Jjfc | : j £ G r , k £ G e , r,£ = l,...,q 



(29) 



Assuming (29) and using (j28j), a standard clustering algorithm, using e.g. single-linkage 
|£j,jfc| between variables X^' and X^ k \ will consistently find the 
0. 



and dissimilarity 1 
true clusters if log(p)/n 



In summary, and rather obvious: the higher the correlation within and uncorrelatedness 
between clusters, the better we can estimate the true underlying grouping. In this sense, 



and following the arguments in Sections 4.1 4.4 strong correlation within clusters "is a 
friend" when using cluster Lasso methods, while it is "an enemy" (at least for variable 
screening and selection) for plain Lasso. 



4.6 Some first illustrations 

We briefly illustrate some of the points and findings mentioned above for the CRL 
and the plain Lasso. Throughout this subsection, we show the results from a single 
realization of each of different models. More systematic simulations are shown in Section 
[5} We analyze scenarios with p = 1000 and n = 100. Thereby, the covariates are 



generated as in model (20) where U ~ M q (0,I) with q = 5 and r = 0.5, and thus, 
Cov(X) = S is of block-diagonal structure. The response is as in the linear model ([T]) 
with e ~ jV n (0, 1). We consider the following. 

Correct and incorrect clustering. The correct clustering consists of q = 5 clusters each 



having m r = \G r \ = 200 variables, corresponding to the structure in model (20). An 
incorrect clustering was constructed as 5 clusters where the first half (100) of the vari- 
ables in each constructed cluster correspond to the first half (100) of the variables in 
each of the true 5 clusters, and the remaining second half (100) of the variables in the 
constructed clusters are chosen randomly from the total of 500 remaining variables. We 
note that m r = 200 > n = 100 and thus, the CGL method is inappropriate (see e.g. 



Proposition 4.1 ). 



Active variables and regression coefficients. We always consider 3 active groups (a group 
is called active if there is at least one active variables in the group). The scenarios are 
as follows: 
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(a) One active variable within each of 3 active groups, namely Sq = {1,201,401}. 
The regression coefficients are fH® = —1, ffioi = — lj = 1; 

(b) 4 active variables within each of 3 active groups, namely So = {1, 2, 3, 4, 201, 202, 
203,204,401,402,403,404}. The regression coefficients are 0} = 0.25 for j G S ; 

(c) as in (b) but with regression coefficients {/3j, j G So} i.i.d. ~ Unif ([— 0.5, 0.5]; 

(d) as in (b) but with exact cancellation of coefficients: /3® = /3® = 2, = ^4 = —2, 

/^201 = ^203 = 2) /?202 = <^204 = ~~ 2, /?4 01 = /3 403 = 2, /3 402 = /?4 04 = —2. 

For the scenario in (d), we had to choose large coefficients, in absolute value equal to 
2, in order to see clear differences, in favor of the plain Lasso. Figure [2j using the 



R-package glmnet (Friedman et al. 2010b) shows the results. 



The results seem rather robust against approximate cancellation of coefficients (Subfig- 
ure[2^c)) and incorrect clustering (right panels in Figure [2]). Regarding the latter, the 
number of chosen clusters is worse than for correct clustering, though. A main message 
of the results in Figure [2] is that the predictive performance (using cross-validation) 
is a good indicator whether the group representative Lasso (with correct or incorrect 
clustering) works. 

We can complement the rules from Section [2] for determining the number of clusters 
as follows: take the representative cluster Lasso method with the largest clusters (the 
least refined partition of {1, ... ,p}) such that predictive performance is still reasonable 
(in comparison to the best achieved performance where we would always consider plain 
Lasso among the competitors as well). In the extreme case of Subfigure [2^d), this rule 
would choose the plain Lasso (among the alternatives of correct clustering and incorrect 
clustering) which is indeed the least refined partition such that predictive performance 
is still reasonable. 



5 Numerical results 

In this section we look at three different simulation settings and a pseudo real data 
example in order to empirically compare the proposed cluster Lasso methods with plain 
Lasso. 



5.1 Simulated data 

Here, we only report results for the CRL and CGL methods where the clustering of 
the variables is based on canonical correlations using Algorithm [l] (see Section 2.1). 



The corresponding results using ordinary hierarchical clustering, based on correlations 



and with average-linkage (see Section 2.2), are almost exactly the same because for the 
considered simulation settings, both clustering methods produce essentially the same 
partition. 
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Figure 2: Left, middle and right panel of each subfigure: Lasso, cluster representative 
Lasso with correct clustering, and cluster representative Lasso with incorrect clustering 



as described in Section 4.6 10- fold CV squared error (y-axis) versus log(A) (x-axis). 
Grey bars indicate the region 10- fold CV squared error +/— estimated standard error 
(s.e.) of 10-fold CV squared error. The left vertical bar indicates the minimizer of the 
CV error and the right vertical bar corresponds to the largest value such that the CV 
error is within one standard deviation (s.d.) of the minimum. The numbers on top of 
each plot report the number of selected variables (in case of the cluster representative 
Lasso, the number of selected representatives): the number of active groups is always 
equal to 3, and the number of active variables is 3 for (a) and 12 for (b)-(d). Subfigures 



(a)-(d) correspond to the scenarios (a)-(d) in Section 4.6 



We simulate data from the linear model in ([!]) with fixed design X, e ~ A/"„(0, a 2 1) with 
n = 100 and p = 1000. We generate the fixed design matrix X once, from a multivariate 
normal distribution M p (0, S) with different structures for E, and we then keep it fixed. 
We consider various scenarios, but the sparsity or size of the active set is always equal 
to s = 20. 

In order to compare the various methods, we look at two performance measures for 
prediction and variable screening. For each model, our simulated data consists of a 
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training and an independent test set. The models were fitted on the training data 
and we computed the test set mean squared error n" 1 Yl7=i ^[(^testPQ) — ^(^)) 2 ]j 
denoted by MSE. For variable screening we consider the true positive rate as a measure 
of performance, i.e., \S n Sol/ISol as a function of \S\. 

For each of the methods we choose a suitable grid of values for the tuning parameter. 
All reported results are based on 50 simulation runs. 



5.1.1 Block diagonal model 

Consider a block diagonal model where we simulate the covariables X ~ A/^,(0, E^) 
where is a block diagonal matrix. We use a 10 x 10 matrix T, where 




T- - 

1 * ,3 ~ 

The block-diagonal of E^ consists of 100 such block matrices V. Regarding the regression 
parameter /3°, we consider the following configurations: 

(Aa) So = {1, 2, . . . , 20} and for any j £ Sq we sample j3j from the set {2/so, 4/so, • • • , 2} 
without replacement (anew in each simulation run). 

(Ab) S = {1, 2, 11, 12, 21, 22, ... , 91, 92} and for any j £ S we sample /3° from the set 
{2/so, 4/soj ■ • • , 2} without replacement (anew in each simulation run). 

(Ac) f3° as in (Aa) but we switch the sign of half and randomly chosen active parameters 
(anew in each simulation run). 

(Ad) P° as in (Ab) but we switch the sign of half and randomly chosen active parameters 
(anew in each simulation run). 

The set-up (Aa) has all the active variables in the first two blocks of highly correlated 
variables. In the second configuration (Ab), the first two variables of each of the first ten 
blocks are active. Thus, in (Aa), half of the active variables appear in the same block 
while in the other case (Ab), the active variables are distributed among ten blocks. The 
remaining two configurations (Ac) and (Ad) are modifications in terms of random sign 



changes. The models (Ab) and (Ad) come closest to the model (21)-(20) considered for 
theoretical purposes: the difference is that the former models have two active variables 
per active block (or group) while the latter model has only one active variable per active 
group. 

Tableland Figure [3] describe the simulation results. From Table [2] we see that over all 
the configurations, the CGL method has lower predictive performance than the other 
two methods. Comparing the two methods Lasso and CRL, we can not distinguish a 
clear difference with respect to prediction. We also find that sign switches of half of the 
active variables (Ac, Ad) do not have a negative effect on the predictive performance 
of the CRL method (which in principle could suffer severely from sign switches). The 
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a Method 


(Aa) 


(Ab) 


(Ac) 


(Ad) 


V I V 1 - 

3 CGL 

Lasso 


1U. / O 1 1 .U 1 j 

14.97 (2.40) 
11.94 (1.97) 


1 K ^7 (0 A?V\ 
lo.o t y^.^to j 

37.05 (5.21) 

16.23 (2.47) 


1 ^ OR (\ 

lO.UO ^ I.Ul) j 

13.34 (2.06) 
12.72 (1.67) 


24.31 (6.50) 
15.34 (2.53) 


CRL 
12 CGL 

Lasso 


161.73 (25.74) 
206.19 (29.97) 
168.53 (25.88) 


177.90 (25.87) 
186.61 (25.69) 
179.47 (25.77) 


157.86 (20.63) 
160.31 (23.04) 
158.02 (20.31) 


165.30 (23.56) 
168.26 (24.70) 
166.50 (23.74) 



Table 2: MSE for the block diagonal model with standard deviations in brackets. 



CGL method even gains in predictive performance in (Ac) and (Ad) compared to the 
no sign-switch configurations (Aa) and (Ab). 
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Figure 3: Plot of versus \S\ for block diagonal model. Cluster representative 

pi 

Lasso (CRL, black solid line), cluster group Lasso (CGL, blue dashed-dotted line), and 
Lasso (red dashed line). 



From Figure [3] we infer that for the block diagonal simulation model the two methods 
CRL and CGL outperform the Lasso concerning variable screening. Taking a closer 
look, the Cluster Lasso methods CRL and CGL benefit more when having a lot of 
active variables in a cluster as in settings (Aa) and (Ac). 
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5.1.2 Single block model 



We simulate the covariables X ~ A/" p (0, where 



1, i = 3, 

0.9, i,j G {1, . . . , 30} and % 7^ j, 
0, else. 



Such aSfi corresponds to a single group of strongly correlated variables of size 30. The 
rest of the 970 variables are uncorrelated. For the regression parameter /3° we consider 
the following configurations: 

(Ba) So = {1, 2, . . . , 15} U {31, 32, . . . , 35} and for any j £ So we sample /3? from the 
set {2/so, 4/so, • • • , 2} without replacement (anew in each simulation run). 

(Bb) So = {1, 2, . . . , 5} U {31, 32, . . . , 45} and for any j 6 So we sample ft® from the set 
{2/so, 4/soj ■ • • , 2} without replacement (anew in each simulation run). 

(Be) f3° as in (Ba) but we switch the sign of half and randomly chosen active parameters 
(anew in each simulation run). 

(Bd) (3° as in (Bb) but we switch the sign of half and randomly chosen active parameters 
(anew in each simulation run). 

In the fist set-up (Ba), a major fraction of the active variables are in the same block of 
highly correlated variables. In the second scenario (Bb), most of the active variables are 
distributed among the independent variables. The remaining two configurations (Be) 
and (Bd) are modifications in terms of random sign changes. The results are described 
in Tableland Figure |4j 



a Method 


(Ba) 


(Bb) 


(Be) 


(Bd) 


CRL 
3 CGL 

Lasso 


16.73 (2.55) 
247.52 (28.74) 
17.13 (3.01) 


27.91 (4.80) 
54.73 (10.59) 
27.18 (4.51) 


15.49 (2.93) 
21.37 (9.51) 
15.02 (2.74) 


22.17 (4.47) 
31.58 (14.17) 
21.91 (4.48) 


CRL 
12 CGL 

Lasso 


173.89 (23.69) 
384.78 (48.26) 
173.37 (23.23) 


181.62 (24.24) 
191.26 (25.55) 
178.86 (23.80) 


161.01 (23.19) 
159.40 (23.88) 
160.55 (22.80) 


175.49 (23.61) 
174.49 (25.40) 
174.14 (23.14) 



Table 3: MSE for the single block model with standard deviations in brackets. 



In Table [3] we see that over all the configurations the CRL method performs as well 
as the Lasso, and both of them outperform the CGL. We again find that the CGL 
method gains in predictive performance when the signs of the coefficient vector are not 
the same everywhere, and this benefit is more pronounced when compared to the the 
block diagonal model. 

The plots in Figure [4] for variable screening show that the CRL method performs better 
than the Lasso for all of the configurations. The CGL method is clearly inferior to the 
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Figure 4: Plot of versus \S\ for single block model. Cluster representative Lasso 

(CRL, black solid line), cluster group Lasso (CGL, blue dashed-dotted line), and Lasso 
(red dashed line). 

Lasso especially in the setting (Ba) where the CGL seems to have severe problems in 
finding the true active variables. 

5.1.3 Duo block model 



We simulate the covariables X according to Af p (0,T,c) where Ec is a block diagonal 
matrix. We use the 2x2 block matrices 



r 



1 0.9 
0.9 1 



and the block diagonal of £<7 consists now of 500 such block matrices T. In this setting 
we only look at one set-up for the parameter f3°: 



(C) So = {!,..., 20} with 00 



2, 

1 / logp" , 
3 V n 

IT9 



j G {1,3,5,7,9,11,13,15,17, 19}, 
j G {2,4,6,8,10,12,14,16,18,20}. 



The idea of choosing the parameters in this way is given by the fact that the Lasso 
would typically not select the variables from {2, 4, 6, ... , 20} but selecting the other 
from {1,3,5,..., 19}. The following Table [4] and Figure [5] show the simulation results 
for the duo block model. 
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a 


Method 


(C) 




V I V 1 . 


99 A*\ (A 9R\ 


3 


CGL 


32.00 (6.50) 




Lasso 


22.45 (4.64) 


12 


CRL 
CGL 
Lasso 


190.93 (25.45) 
193.97 (27.05) 
190.91 (25.64) 



Table 4: MSE for duo block model with standard deviations in brackets. 




Figure 5: Plot of ^["1^ versus \S\ for duo block model. Cluster representative Lasso 
\s\ 

(CRL, black solid line), cluster group Lasso (CGL, blue dashed-dotted line), and Lasso 
(red dashed line). 



From Table [4] we infer that for the duo block model, all three estimation methods have a 
similar prediction performance. Especially for a = 12 we see no difference between the 
methods. But in terms of variable screening, we see in Figure [5] that the two techniques 
CRL and CGL are clearly better than the Lasso. 



5.2 Pseudo-real data 



For the pseudo real data example described below, we also consider the CRL method 



with ordinary hierarchical clustering as detailed in Section 2.2 We denote the method 
by CRLcor. 

We consider here an example with real data design matrix X but synthetic regression 
coefficients /3° and simulated Gaussian errors J\f n (0,a 2 I) in a linear model as in (jlj. 
For the real data design matrix X we consider a data set about riboflavin (vitamin B2) 
production by bacillus subtilis. That data has been provided by DSM (Switzerland). 
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The covariates are measurements of the logarithmic expression level of 4088 genes (and 
the response variable Y is the logarithm of the riboflavin production rate, but we do not 
use it here). The data consists of n = 71 samples of genetically engineered mutants of 
bacillus subtilis. There are different strains of bacillus subtilis which are cultured under 
different fermentation conditions, which makes the population rather heterogeneous. 

We reduce the dimension to p = 1000 covariates which have largest empirical variances 
and choose the size of the active set as so = 10. 

(Dl) So is chosen as a randomly selected variable k and the nine covariates which have 
highest absolute correlation to variable k (anew in each simulation run). For each 
j G Sq we use (3® = 1. 

(D2) So is chosen as one randomly selected entry in each of the five biggest clusters 
of both clustering methods (using either Algorithm [I] or hierarchical clustering as 



described in Section 2.2) resulting in sq = 10 (anew in each simulation run). For 



each j £ So we use pf= 1. 
The results are given in Tableland Figure [6j based on 50 independent simulation runs. 



a 


Method 


(Dl) 


(D2) 




CRL 


2.47 (0.94) 


2.99 (0.72) 


3 


CGL 


2.36 (0.93) 


3.13 (0.74) 




Lasso 


2.47 (0.94) 


2.96 (0.60) 




CRLcor 


39.02 (25.15) 


7.08 (2.76) 




CRL 


19.62 (10.11) 


14.80 (4.91) 


12 


CGL 


17.49 (9.28) 


14.90 (5.44) 




Lasso 


19.63 (10.00) 


15.66 (4.84) 




CRLcor 


50.40 (27.68) 


15.46 (5.74) 



Table 5: Prediction error for the pseudo real riboflavin data with standard deviations 
in brackets. 

Table [5] shows that we do not really gain any predictive power when using the proposed 
cluster lasso methods CRL or CGL: this finding is consistent with the reported results 
for simulated data in Section |5.1| The method CRLcor, using standard hierarchical 



clustering based on correlations (see Section 2.2) performs very poorly: the reason is 
that the automatically chosen number of clusters results in a partition with one very 
large cluster, and the representative mean value of such a very large cluster seems to be 
inappropriate. Using the group Lasso for such a partition (i.e., clustering) is ill-posed 
as well since the group size of such a large cluster is larger than sample size n. 

Figure [6] shows a somewhat different picture for variable screening. For the setting 
(Dl), all methods except CRLcor perform similarly, but for (D2), the two cluster Lasso 
methods CRL and CGL perform better than plain Lasso. Especially for the low noise 
level a = 3 case, we see a substantial performance gain of the CRL and CGL compared 
to the Lasso. Nevertheless, the improvement over plain Lasso is less pronounced than 
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Figure 6: Plots of ^rg|p versus \S\ for the pseudo real riboflavin data. Cluster repre- 
sentative Lasso (CRL, black solid line), cluster group Lasso (CGL, blue dashed-dotted 
line), Lasso (red dashed line) and CRLcor (magenta dashed-dotted line). 



for quite a few of the simulated models in Section 5.1 The CRLcor method is again 



performing very poorly: the reason is the same as mentioned above for prediction while 
in addition, if the large cluster is selected, it results in a large contribution of the 
cardinality \S\. 



5.3 Summarizing the empirical results 

We clearly see that in the pseudo real data example and most of the simulation set- 
tings, the cluster Lasso techniques (CRL and CGL) outperform the Lasso in terms of 
variable screening; the gain is less pronounced for the pseudo real data example. Con- 
sidering prediction, the CRL and the Lasso display similar performance while the CGL 
is not keeping up with them. Such a deficit of the CGL method seems to be caused 



31 



for cases where we have many non-active variables in an active group, leading to an 
efficiency loss: it might be repaired by using a sparse group Lasso (Friedman et al 



2010a). The difference between the clustering methods, Algorithm [I] and standard hier 



archical clustering based on correlations (see Section [2. 2[ ), is essentially nonexistent for 
the simulation models in Section 5.1 while for the pseudo real data example in Section 



5.2 the disagreement is huge and our novel Algorithm [T] leads to much better results. 



6 Conclusions 

We consider estimation in a high-dimensional linear model with strongly correlated 
variables. In such a setting, single variables cannot (or are at least very difficult to) 
be identified. We propose to group or cluster the variables first and do subsequent 
estimation with the Lasso for cluster-representatives (CRL: cluster representative Lasso) 
or with the group Lasso using the structure of the inferred clusters (CGL: Cluster group 
Lasso). Regarding the first step, we present a new bottom-up agglomerative clustering 
algorithm which aims for small canonical correlations between groups: we prove that it 
finds an optimal solution, that it is statistically consistent, and we give a simple rule for 
selecting the number of clusters. This new algorithm is motivated by the natural idea 
to address the problem of almost linear dependence between variables, but if preferred, 
it can be replaced by another suitable clustering procedure. 

We present some theory which: (i) shows that canonical correlation based clustering 
leads to a (much) improved compatibility constant for the cluster group Lasso; and 
(ii) addresses bias and detection issues when doing subsequent estimation on cluster 
representatives, e.g. as with the CRL method. Regarding the second issue (ii), one 
favorable scenario is for (nearly) uncorrelated clusters with potentially many active 
variables in a cluster: the bias due to working with cluster representatives is small if 
the within group correlation is high, and detection is good if the regression coefficients 
within a group do not cancel. The other beneficial setting is for clusters with at most 
one active variable per cluster but the between cluster correlation does not need to be 
very small: if the cluster size is large or the correlation within the clusters is large, the 
bias due to cluster representatives is small and detection works well. We note that large 
cluster sizes cannot be properly handled by the cluster group Lasso while they can be 
advantageous for the cluster representative Lasso; instead of the group Lasso, one should 



take for such cases a sparse group Lasso (Friedman et al. 2010a) or a smoothed group 



Lasso ( Meier et al. , 2009 ) . Our theoretical analysis sheds light when and why estimation 
with cluster representatives works well and leads to improvements, in comparison to the 
plain Lasso. 

We complement the theoretical analysis with various empirical results which confirm 
that the cluster Lasso methods (CRL and CGL) are particularly attractive for improved 
variable screening in comparison to the plain Lasso. In view of the fact that variable 
screening and dimension reduction (in terms of the original variables) is one of the main 
applications of Lasso in high-dimensional data analysis, the cluster Lasso methods are 
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an attractive and often better alternative for this task. 



7 Proofs 

7. 1 Proof of Theorem EO 

We first show an obvious result. 

Lemma 7.1 Consider a partition Q = {G\, . . . ,G q } which satisfies |I|). Then, for 
every r, £ € {1, . . . , q} with r ^ I: 

/Ocan(^i) J2) < t for all subsets J\ C G r , J2 Q G^ . 
The proof follows immediately from the inequality 

Pcan 

□ 



For proving Theorem 2.1, the fact that we obtain a solution satisfying Q is a straight- 



forward consequence of the definition of the algorithm which continues to merge groups 
until all canonical correlations between groups are less or equal to r. 

We now prove that the obtained partition is the finest clustering with r-separation. Let 
Q(t) = {Gi, . . . , G q } be an arbitrary clustering with r-separation and Gb,b = 1, . . . , b* , 
be the sequence of partitions generated by the algorithm (where b* is the stopping 
(first) iteration where r-separation is reached). We need to show that Gb* is a finer 
partition of {1, . . . ,p} than Q{t). Here, the meaning of "finer than" is not strict, i.e., 
including "equal to" . To this end, it suffices to prove by induction that Gb is finer than 
Q{t) for b = 1, . . . , 6*. This is certainly true for 6 = 1 since the algorithm begins with 
the finest partition of {1, ... ,p}. Now assume the induction condition that Gb is finer 
than Q{t) for b < b* . The algorithm computes Gb+i by merging two members, say 
G' and G", of Qb such that p can (G' ,G") > r. Since Gb is finer than G(r), there must 
exist members Gj 1 and Gj 2 of G{t) such that G' C Gj 1 and G" C Gj 2 . This implies 



Pca,n{Gj x ,Gh) > Pc&n(G', G") > t, see Lemma [7l| Since p ca ,n(Gj,G k ) < r for all j / k, 
we must have ji = j2- Thus, the algorithm merges two subsets of a common member 
(namely Gj x = Gj 2 ) of G(t). It follows that Gb+i is still finer than G{t). □ 

7.2 Proof of Theorem IP 



For ease of notation, we abbreviate a group index G r by r. The proof of Theorem 2.2 
is based on the following bound for the maximum difference between the sample and 
population correlations of linear combination of variables. Define 

A r , e = max UX^u, X<%) - p(u T X (r \ v T X^)\. 
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Lemma 7.2 Consider X as in (J) and g° = {Gi,...,G q } a partition of {1, ... ,p}. 
Let = Cov(X^ r \X^), t>0 and d r = rank(£ rir .). Define A* r/ by 0). Then, 



{ max (A ri ~ Ki) > °] < exp(-i). 

l<j<k<q 



Proof of Lemma 7.2 Taking a new coordinate system if necessary, we may assume 
without loss of generality that H r>r = L<i r and = . Let S ri £ be the sample versions 
of S r ^. For r ^ £, we write a linear regression model XW = xM£ r/ + eM^f such 
that is annxdf matrix of i.i.d. iV(0, 1) entries independent of X( r ) and || ^^Ik-S 1 ) — 
1, where || • \\rg\ is the spectrum norm. This gives E ri £ = E^E^ + pcW^gM^y /rt. 
Let *7 r L/JF be the SVD of the projection XWS- r 1 (X( r )) T /n, with C/ r G M nxdl \ Since 
X( r ) is independent of e^^, Uje^^ is a d r x matrix of i.i.d. iV(0, 1) entries. Let 

Ur = {||SV 2 - 1^11(5) < Uh V r ,i = {||^ e W) /V^II(S) < U A t t }. 

It follows from Q and Theorem 11.13 of Davidson and Szarek (2001) that 

max{P[^],P[0^]} <2P[iV(0,l) > y/2{t + log(q(q + 1)))] < 2 e - t /{q(q + 1)}. 

In the event £l r , we have 



max 

|M|2 = |M|2=1 



UE^ r>V 



,^1/2 I, iiAl/2 n 
Zj r j- It 2 prr ^2 



T 
U V 



< ||E rjr - J dr || (s) /(l-t r ) 2 + max 

ll«l|2 = ||u||2 = l 

< {(l+t r .) 2 -l}/(l-i r ) 2 + l/(l-t r ) 2 

= A* . 



T 

U V 



1^1/2 n ||Al/2 n 
Zj rr U 2 prr "^ 2 



T 
U V 



" —1/2 

For r^l, the variable change u —> Jj r ,r u gives 

E r ,£ 



A r £ = max 

[|«[|a=[|«||a=l 



u 



Tg-l/2 



lvV2 || 



IV" 1 / 2 II 
Zj r r U 2 



In the event f2 r n fl we have 

A r j < 



max 

I "II2HMI2 



^ ||nS rir 1/2 (S r ^ - S r ^)u|/||E]^ 2 u|| 2 
u T S- r 1 / 2 E r ^(l/||sJ/ 2 .;||2 - l/llt-y^lb)!} 
" ||«|| 2 =S 2 =1 O^&V* - Mllcsj/H^lla + |ll^r /2 «ll2/ll^ 2 v||2 " l|} 

< \\(h 2 ~ + S- r 1/2 (X W ) T e^</ 2 /n||(5)/(l - fc) 

+ l/{(l-tr)(l-^)}-l 

< (||SV 2 - S-y 2 || (5 ) + WUU^IMVS^I^ - U) + (U + U)/{(1 - t r )(l - U)} 

< {1/(1 - t r ) - (1 - t r ) +t r A t e }/(l - t t ) + (t r + t e )/{(l - t r ){l - tt)} 



34 



= (1 - (1 - t r ) 2 + (1 - t r )(t r A t t ) +t r + t e )/{(l - t r )(l - ti)}. 



Thus, for t r < te, we find A r / < A* £ . Since A r 



A^ r , the above bounds hold 



simultaneously in the intersection of all £l r and those f2 r / with either t r < tg or r < i 
for t r = te. Since there are totally q + q(q — l)/2 = q(q + l)/2 such events, P[A r> £ < 
A*. Vr,<| > l-exp(-f). □ 



Proof of Theorem 2.2, It follows from rt5h that is the finest population clustering 



with r-separation for all r_ < r < r+. Since p C an(G r , Gg) = max u , B p(u T X^ r \ v T X^) 
and p C an{G r ,Ge) = max u ^ /5(X( r )u, X^-u), (|5j) and Lemma 7.2 implies that with at 
least probability 1 — e - *, the inequalities 

Pcan{G r , Gg) < Pcan{G r , Gg) + A*^ < T_, 

max Pcan{G r; ki,G r ,k 2 ) > max Pcan(G r; ki,G r .k 2 ) — A* r > r + , 

fcl<fc2 fcl<fc2 

hold simultaneously for all r < £ and nontrivial partitions {G r; k,k < q r } of G>,r = 
1, . . . , g. In this case, Q° is also the finest sample clustering with r-separation for all 
t_ < t < ri . The conclusion follows from Theorem 12.11 □ 



7.3 Proof of Theorem HID 

We write for notational simplicity So := So,Group and so := |5o,Group|- Note first that 
for all r, 

\\^ {Gr) f3 Gr \\ 2 2 = nPl± r ,rP Gr = n\\ lGr \\l 

1/2 

where 7c r := S r ' r $q t . Moreover, 

\\X {So) fJs \\ 2 2=n 7 lR So 7s . 

It follows that 

EW* {Gr) Pe r \\l = n\hs \\l 
< njlR So 7So /ALn = ll^&olliML*. (30) 



Furthermore, 



n 



S ^G r ^r~r 1/2 ^% 1/2 7G , <™J^ ^ Pcan(G r , G/) ||7 Gr || 2 hd h 

reS teS$ 

Y, E Pcan(G r ,G,)||X( G ")/3 Gr || 2 ||X^)/3 G£ || 2 

r£S fcSg 
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^ ^ Pcan(G r , G*£ 



|X^)/3 Gr 




<p||x( 5o )/35 ||2,i||X (5 o c) /35sll2,i- 
Hence, for all /3 satisfying ||/3sg||2,i < 3||/?s 1 1 2 , i , we have 



" m 



(31) 



Applying the Cauchy-Schwarz inequality and (7.3) gives 



'Sol 



< 



m 



52\\X«UpGr\\l< 
reSo 



Er-eso" 1 - \\X {So) Ps \\ 2 2 _ s fh So ||X( S °>/3 S J! 



in 



A 2 . 

mm 



in 



A 2 . 

mm 



Insert this in (31) to get 



Use the assumption that 



mA min 



and apply Lemma 6.26 in Biihlmann and van de Geer (2011) to conclude that 



iX^sJ?, < (1 



3s m So 

™ A min 



mil 



Hence, 



2 s m So ||X( s o)^ 5o | 
2,1 ^ 



A 2 . 

mm 



< 



1 ^ 3s m5 ^ 2 ^ ^prngp 



771 



A 2 



V m / A 2 



IX/ 



This leads to the first lower bound in the statement of the theorem. The second lower 
bound follows immediately by the incoherence assumption for p. Furthermore, it is not 
difficult to see that A 2 ^ > (1 — |So,Group|ps Group ), and using the incoherence assumption 
for ps Grou leads to strict positivity. □ 
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7.4 Proof of Proposition 



4.2 



We can invoke the analysis given in Biihlmann and van de Geer (2011, Th.6.1). The 
slight deviations involve: (i) use that E[r/i|Z] = 0; (ii): due to the Gaussian assumption 



E[Varfa|Z)] = Varfa 



cr 



• + E[(//. 



X 



Var(r/j|Z) is constant equaling Var(?7j|2 
Hz) 2 ]', and (hi): the probability bound in Biihlmann and van de Geer (2011, Lem.6.2) 

|<5"z||oo- 
□ 



can be easily obtained for non-standardized variables when multiplying Ao with 
The issues (ii) and (hi) explain the factors appearing in the definition of Ao- 



7.5 Proof of Proposition 4.3 

Because of uncorrelatedness of X^ among r = 1, . . . , q, we have: 



Define 



Wj 



Cov(y,xW) = i E^Covp^xQ)) 

Var(XM) 1 r| Var(XM) 

E jeGr gVajXg)) + ^Cov(xW, XQ-)) 
' r ' E teGr (Var(XW) + E^Cov(XW,xW)) ' 

Var(xO)) + Ei^ Cov(X®,X®) 

: EW^P) + E^Cov(x«,xW))- 



Then, E jeGr ^ = 1 and 7° = EjeG r ^ 



o 



(32) 



The statement 1 follows immediately from (32). Regarding statement 2, we read off 
from (32) that Wj = |G r | _1 for all j £ G r and hence 7° = EjeG r D 



7.6 Proof of Proposition |4.4 

We have 



o _ Cov(y,X( r )|{X^; e^r}) 
7r ~ Var(XM|{XW; I ± r}) ' 



since for Gaussian distributions, partial covariances equal conditional covariances (Baba 



et al. 2004, cf.). For the numerator, we have: 

Cov(Y,X^\{XW; £^r}) = \Gr\~ 1 ^ Cov(Y, X^\{X^ ; £ ^ r}) 



j&Gr 



\Gr\~ 1 PiCov(X^,X^\{xW; £^ r }) 



l^l" 1 E Y,$Cov(xV,X^\{xW; i^r}) 

j£Gr i^Gr 

\Gr\~ 1 A Cov(X«,X^|{x( £ ); l^r}) + T r , \Y r \ < W^u. 

i,jeG r 
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for the denominator, we have: 

Var(xW|{xW; £ ^ r}) = \G r \- 2 Cov(X®,X®\{X®; t^r}). 

i,jeG r 

Defining 

Wj Zi^G r Cov(XV,XW\{X(t);e^r}y 

we obtain 7° = \G r \ ^2j & G r w j^j + ^ r wr th |A r < ||/3°||ii//C The other statement 
follows using (33) and as statement 1. in Proposition 4.3. □ 



7.7 Proof of Proposition |4.5 

Write 

Y = X T 7° + t) = U T P° + e = X T p° - Wf3° + s. 

Therefore, 

X(/3°-7°) = r]-e + W T (3°. 
Taking the squares and expectation on both sides, 

0° - 7°) T Cov(X)(/3° - 7 ) = E[(?7 - e) 2 } + E\W T (3°\ 2 = B 2 + E| W T (3 Q \ 2 < 2E\W T (3°\ 2 , 



where the last inequality follows from (22). Since Cov(X) = Cov(U) + Cov(W), we 
\ min (Cov(X)) > A mm , 



have that A min (Cov(X)) > X 2 nin (Cov(U)). This completes the proof. □ 
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